GROUP NO - 38

Names of the group members:-

PRANAV JADHAV——(0774109)

SAHAJ PATEL———–(0774578)

GATI SONANI————(0779956)

HARSH TRIVEDI——–(0788765)

PUNAM DESAI———-(0785752)

URVASHI PRAJAPATI-(0785750)

Our Project represents our own work and we have adhered to St. Clair College’s Academic Integrity policies in completing this project.

## # A tibble: 2,930 x 34
##    Lot_Frontage Lot_Area Year_Built Year_Remod_Add Mas_Vnr_Area BsmtFin_SF_1
##           <dbl>    <int>      <int>          <int>        <dbl>        <dbl>
##  1          141    31770       1960           1960          112            2
##  2           80    11622       1961           1961            0            6
##  3           81    14267       1958           1958          108            1
##  4           93    11160       1968           1968            0            1
##  5           74    13830       1997           1998            0            3
##  6           78     9978       1998           1998           20            3
##  7           41     4920       2001           2001            0            3
##  8           43     5005       1992           1992            0            1
##  9           39     5389       1995           1996            0            3
## 10           60     7500       1999           1999            0            7
## # ... with 2,920 more rows, and 28 more variables: BsmtFin_SF_2 <dbl>,
## #   Bsmt_Unf_SF <dbl>, Total_Bsmt_SF <dbl>, First_Flr_SF <int>,
## #   Second_Flr_SF <int>, Gr_Liv_Area <int>, Bsmt_Full_Bath <dbl>,
## #   Bsmt_Half_Bath <dbl>, Full_Bath <int>, Half_Bath <int>,
## #   Bedroom_AbvGr <int>, Kitchen_AbvGr <int>, TotRms_AbvGrd <int>,
## #   Fireplaces <int>, Garage_Cars <dbl>, Garage_Area <dbl>, Wood_Deck_SF <int>,
## #   Open_Porch_SF <int>, Enclosed_Porch <int>, Three_season_porch <int>,
## #   Screen_Porch <int>, Pool_Area <int>, Misc_Val <int>, Mo_Sold <int>,
## #   Year_Sold <int>, Sale_Price <int>, Longitude <dbl>, Latitude <dbl>
skim(ames_numeric)
Data summary
Name ames_numeric
Number of rows 2930
Number of columns 34
_______________________
Column type frequency:
numeric 34
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Lot_Frontage 0 1 57.65 33.50 0.00 43.00 63.00 78.00 313.00 ▇▇▁▁▁
Lot_Area 0 1 10147.92 7880.02 1300.00 7440.25 9436.50 11555.25 215245.00 ▇▁▁▁▁
Year_Built 0 1 1971.36 30.25 1872.00 1954.00 1973.00 2001.00 2010.00 ▁▂▃▆▇
Year_Remod_Add 0 1 1984.27 20.86 1950.00 1965.00 1993.00 2004.00 2010.00 ▅▂▂▃▇
Mas_Vnr_Area 0 1 101.10 178.63 0.00 0.00 0.00 162.75 1600.00 ▇▁▁▁▁
BsmtFin_SF_1 0 1 4.18 2.23 0.00 3.00 3.00 7.00 7.00 ▃▂▇▁▇
BsmtFin_SF_2 0 1 49.71 169.14 0.00 0.00 0.00 0.00 1526.00 ▇▁▁▁▁
Bsmt_Unf_SF 0 1 559.07 439.54 0.00 219.00 465.50 801.75 2336.00 ▇▅▂▁▁
Total_Bsmt_SF 0 1 1051.26 440.97 0.00 793.00 990.00 1301.50 6110.00 ▇▃▁▁▁
First_Flr_SF 0 1 1159.56 391.89 334.00 876.25 1084.00 1384.00 5095.00 ▇▃▁▁▁
Second_Flr_SF 0 1 335.46 428.40 0.00 0.00 0.00 703.75 2065.00 ▇▃▂▁▁
Gr_Liv_Area 0 1 1499.69 505.51 334.00 1126.00 1442.00 1742.75 5642.00 ▇▇▁▁▁
Bsmt_Full_Bath 0 1 0.43 0.52 0.00 0.00 0.00 1.00 3.00 ▇▆▁▁▁
Bsmt_Half_Bath 0 1 0.06 0.25 0.00 0.00 0.00 0.00 2.00 ▇▁▁▁▁
Full_Bath 0 1 1.57 0.55 0.00 1.00 2.00 2.00 4.00 ▁▇▇▁▁
Half_Bath 0 1 0.38 0.50 0.00 0.00 0.00 1.00 2.00 ▇▁▅▁▁
Bedroom_AbvGr 0 1 2.85 0.83 0.00 2.00 3.00 3.00 8.00 ▁▇▂▁▁
Kitchen_AbvGr 0 1 1.04 0.21 0.00 1.00 1.00 1.00 3.00 ▁▇▁▁▁
TotRms_AbvGrd 0 1 6.44 1.57 2.00 5.00 6.00 7.00 15.00 ▁▇▂▁▁
Fireplaces 0 1 0.60 0.65 0.00 0.00 1.00 1.00 4.00 ▇▇▁▁▁
Garage_Cars 0 1 1.77 0.76 0.00 1.00 2.00 2.00 5.00 ▅▇▂▁▁
Garage_Area 0 1 472.66 215.19 0.00 320.00 480.00 576.00 1488.00 ▃▇▃▁▁
Wood_Deck_SF 0 1 93.75 126.36 0.00 0.00 0.00 168.00 1424.00 ▇▁▁▁▁
Open_Porch_SF 0 1 47.53 67.48 0.00 0.00 27.00 70.00 742.00 ▇▁▁▁▁
Enclosed_Porch 0 1 23.01 64.14 0.00 0.00 0.00 0.00 1012.00 ▇▁▁▁▁
Three_season_porch 0 1 2.59 25.14 0.00 0.00 0.00 0.00 508.00 ▇▁▁▁▁
Screen_Porch 0 1 16.00 56.09 0.00 0.00 0.00 0.00 576.00 ▇▁▁▁▁
Pool_Area 0 1 2.24 35.60 0.00 0.00 0.00 0.00 800.00 ▇▁▁▁▁
Misc_Val 0 1 50.64 566.34 0.00 0.00 0.00 0.00 17000.00 ▇▁▁▁▁
Mo_Sold 0 1 6.22 2.71 1.00 4.00 6.00 8.00 12.00 ▅▆▇▃▃
Year_Sold 0 1 2007.79 1.32 2006.00 2007.00 2008.00 2009.00 2010.00 ▇▇▇▇▃
Sale_Price 0 1 180796.06 79886.69 12789.00 129500.00 160000.00 213500.00 755000.00 ▇▇▁▁▁
Longitude 0 1 -93.64 0.03 -93.69 -93.66 -93.64 -93.62 -93.58 ▅▅▇▆▁
Latitude 0 1 42.03 0.02 41.99 42.02 42.03 42.05 42.06 ▂▂▇▇▇
glimpse(ames_numeric)
## Rows: 2,930
## Columns: 34
## $ Lot_Frontage       <dbl> 141, 80, 81, 93, 74, 78, 41, 43, 39, 60, 75, 0, 63,~
## $ Lot_Area           <int> 31770, 11622, 14267, 11160, 13830, 9978, 4920, 5005~
## $ Year_Built         <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 199~
## $ Year_Remod_Add     <int> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992, 199~
## $ Mas_Vnr_Area       <dbl> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6~
## $ BsmtFin_SF_1       <dbl> 2, 6, 1, 1, 3, 3, 3, 1, 3, 7, 7, 1, 7, 3, 3, 1, 3, ~
## $ BsmtFin_SF_2       <dbl> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1120, 0~
## $ Bsmt_Unf_SF        <dbl> 441, 270, 406, 1045, 137, 324, 722, 1017, 415, 994,~
## $ Total_Bsmt_SF      <dbl> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1595, ~
## $ First_Flr_SF       <int> 1656, 896, 1329, 2110, 928, 926, 1338, 1280, 1616, ~
## $ Second_Flr_SF      <int> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 676, 0,~
## $ Gr_Liv_Area        <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616~
## $ Bsmt_Full_Bath     <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, ~
## $ Bsmt_Half_Bath     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Full_Bath          <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3, 2, ~
## $ Half_Bath          <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, ~
## $ Bedroom_AbvGr      <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4, 4, ~
## $ Kitchen_AbvGr      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ TotRms_AbvGrd      <int> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, 7, 5, 4, 12, 8,~
## $ Fireplaces         <int> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, ~
## $ Garage_Cars        <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, ~
## $ Garage_Area        <dbl> 528, 730, 312, 522, 482, 470, 582, 506, 608, 442, 4~
## $ Wood_Deck_SF       <int> 210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 157, 48~
## $ Open_Porch_SF      <int> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, 75, 0~
## $ Enclosed_Porch     <int> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ Three_season_porch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Screen_Porch       <int> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, 140, ~
## $ Pool_Area          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Misc_Val           <int> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, ~
## $ Mo_Sold            <int> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, 5, 2, 6, 6, 6, ~
## $ Year_Sold          <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 201~
## $ Sale_Price         <int> 215000, 105000, 172000, 244000, 189900, 195500, 213~
## $ Longitude          <dbl> -93.61975, -93.61976, -93.61939, -93.61732, -93.638~
## $ Latitude           <dbl> 42.05403, 42.05301, 42.05266, 42.05125, 42.06090, 4~

Exploratory Data Analysis

A few examples. You can find more about tabs in rmarkdown here

Univariate plots

1)Year Sold

ggplot(ames_numeric,aes(Year_Sold)) + geom_bar(fill="sky blue", color="black",width = 0.5)+
  labs(title="Year wise distribution of house sold",x="Year sold")

#Description

The Plot shows the distribution of houses sold on year on year,it shows that most of the houses sold counts were more than 600 but in 2010 it went down to 375.

  1. Year Built
a<-ggplot(ames_numeric,aes(Year_Built)) +geom_histogram(fill="orange", color="black", binwidth = 10)+ 
  labs(title="Year wise distribution of house built",x="Year built") + coord_flip()
ggplotly(a)

#Description Plot shows the distribution of houses built based on year ,it shows that most of the houses built in year 2000 and the less houses were built in year 1900 and before 1900.

  1. Garage Cars
ggplot(ames_numeric) + geom_bar(mapping = aes(Garage_Cars),fill="orange", color="black")+
  labs(title="Distribution of car's garage",x="Garage cars")

#Description Above plot depicts the car garages in houses ,so here most of the houses has the 2 car gareges.

  1. Sale Price
ggplot(ames_numeric)+ geom_histogram(mapping = aes(Sale_Price),fill="light green", color="white")+
  labs(title="Distribution of sale price",x="Sale price")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Description The plot shows the distribution of sale prices of the houses.

5)Ground Living Area

abc <- ggplot(ames_numeric, aes(Gr_Liv_Area,binwidth = 5,fill=I("blue"),col=I("red"))) +
       labs(title="Houses with their graded living areas",x="Graded living area")+
  geom_histogram()
ggplotly(abc)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Distribution The plot shows the distribution of graded living area of the houses , here most of the houses has the gradede living area between 900 to 1600 square feet.

Bivariate plots

We show the data in this tab. 1)

ggplot(ames_numeric) + geom_point(mapping=aes(Year_Built,Sale_Price))+ geom_smooth(mapping=aes(Year_Built,Sale_Price))+
  labs(title="Sale price based on year built",x="Year built",y="Sale price")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

#Description Plot shows the distribution of sale price according to year built, so here we can see that in 1900 and in 1940 the trend of sale price was similar but after 1940 linear trend slightly went up.

Boxplot <- ames_numeric %>% select(Year_Sold , Sale_Price) %>% round(6) %>% ggplot() + geom_boxplot(aes(Year_Sold , Sale_Price,group=Year_Sold, fill=Year_Sold), outliers="red")+
  labs(title="Sale price according to year sold",x="Year sold",y="Sale price")
## Warning: Ignoring unknown parameters: outliers
Boxplot

#Description The box plot shows the distribution of Year sold and sale prices of houses.

mh<- ggplot(ames_numeric)+
  geom_point(mapping = aes(x=Gr_Liv_Area,y=Year_Built),color = "Green") +
  labs(title="Graded living area based on year built",x="Graded living area",y="Year built")
ggplotly(mh)

#Description

The above plot shows the distribution of year built and graded living area ,We can see most of the graded living area of houses are lied between 1000 to 2000 square feet .

Ah<- ggplot(ames_numeric)+
  geom_point(mapping = aes(x=Lot_Frontage,y=Sale_Price)) +
  labs(title="Sale price according to lot frontage",x="Lot frontage",y="Sale price")
ggplotly(Ah)

#Description Plot displays the distribution of lot frontge area and the sale price,it clearly shows that there are many houses which does not have lot frontage so that is why it shows 0 and its sale price.

 ggplot(ames_numeric)+
  geom_point(mapping = aes(x=Lot_Area,y=Sale_Price)) +
  labs(title="Sale price according to lot area",x="Lot area",y="Sale price")

#Description The above plot displays the lot area wise sale price ,but most of the houses lot areas are lied between 10000 t0 25000 square feet.

Multivariate

ggplot(ames_numeric) + geom_point(mapping=aes(Year_Built,Sale_Price ,color = Year_Sold ))+
  labs(title="Sale price across all the built year ",x="Year built",y="Sale price")

#Description Plot depicts distribution of year built and sale price,the color shows the year sold ,so here from the color we can say that in 1980 the sale price was increased as compare to 1900 and 1940

plotvar <- ames_numeric$Sale_Price # pick a variable to plot
nclr <- 8 # number of colors
plotclr <- brewer.pal(nclr,"PuBu") # get the colors
colornum <- cut(rank(plotvar), nclr, labels=FALSE)
colcode <- plotclr[colornum] # assign color

# scatter plot
plot.angle <- 45
scatterplot3d(ames_numeric$Lot_Frontage, ames_numeric$Gr_Liv_Area, plotvar, type="h", angle=plot.angle, color=colcode, pch=20, cex.symbols=2, 
  col.axis="gray", col.grid="gray")

Quick description.

Summary

ames_numeric %>% 
  group_by(Year_Sold) %>% 
  summarize(Mean_Sales_Price = mean(Sale_Price))
## # A tibble: 5 x 2
##   Year_Sold Mean_Sales_Price
##       <int>            <dbl>
## 1      2006          181762.
## 2      2007          185138.
## 3      2008          178842.
## 4      2009          181405.
## 5      2010          172598.
ames_numeric %>% 
  group_by(Year_Sold) %>% summarize(Mean_Garage_cars = mean(Garage_Cars))
## # A tibble: 5 x 2
##   Year_Sold Mean_Garage_cars
##       <int>            <dbl>
## 1      2006             1.77
## 2      2007             1.80
## 3      2008             1.73
## 4      2009             1.81
## 5      2010             1.67

Quick description.

Response Variable

Sale_Price is the response variable in ames_numeric dataset. It is not normally distributed. We are transforming it using log10 function.

using Sales Price as response variable

ggplot(ames_numeric)+geom_histogram(aes(Sale_Price),fill="orange", color="black")+
  labs(title="Response variable without transformation",x="Sale price")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(ames_numeric)+geom_histogram(aes(log10(Sale_Price)),fill="orange", color="black")+
  labs(title="Response variable with transformation",x="Sale price")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Modeling

Model 1 (Provide an appropriate heading)

# some code goes here

m1 <- lm(Sale_Price ~ Garage_Area, data = ames_numeric)
summary(m1)
## 
## Call:
## lm(formula = Sale_Price ~ Garage_Area, data = ames_numeric)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -284053  -33609   -5318   25286  488808 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 68470.351   2737.269   25.01   <2e-16 ***
## Garage_Area   237.647      5.271   45.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61380 on 2928 degrees of freedom
## Multiple R-squared:  0.4098, Adjusted R-squared:  0.4096 
## F-statistic:  2033 on 1 and 2928 DF,  p-value: < 2.2e-16
Equation_Model_1<-extract_eq(m1, use_coefs = TRUE)
Equation_Model_1

\[ \operatorname{\widehat{Sale\_Price}} = 68470.35 + 237.65(\operatorname{Garage\_Area}) \]

Model 2 (Provide an appropriate heading)

m2 <- lm(Sale_Price ~ Gr_Liv_Area, data = ames_numeric)
summary(m2)
## 
## Call:
## lm(formula = Sale_Price ~ Gr_Liv_Area, data = ames_numeric)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -483467  -30219   -1966   22728  334323 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13289.634   3269.703   4.064 4.94e-05 ***
## Gr_Liv_Area   111.694      2.066  54.061  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56520 on 2928 degrees of freedom
## Multiple R-squared:  0.4995, Adjusted R-squared:  0.4994 
## F-statistic:  2923 on 1 and 2928 DF,  p-value: < 2.2e-16
Equation_Model_2<-extract_eq(m2, use_coefs = TRUE)
Equation_Model_2

\[ \operatorname{\widehat{Sale\_Price}} = 13289.63 + 111.69(\operatorname{Gr\_Liv\_Area}) \]

Model 3 (Provide an appropriate heading)

m3 <- lm(Sale_Price ~ Lot_Area, data = ames_numeric)
summary(m3)
## 
## Call:
## lm(formula = Sale_Price ~ Lot_Area, data = ames_numeric)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -369375  -47827  -18982   31261  549409 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.534e+05  2.320e+03   66.11   <2e-16 ***
## Lot_Area    2.702e+00  1.806e-01   14.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 77010 on 2928 degrees of freedom
## Multiple R-squared:  0.07105,    Adjusted R-squared:  0.07073 
## F-statistic: 223.9 on 1 and 2928 DF,  p-value: < 2.2e-16
Equation_Model_3<-extract_eq(m3, use_coefs = TRUE)
Equation_Model_3

\[ \operatorname{\widehat{Sale\_Price}} = 153373.89 + 2.7(\operatorname{Lot\_Area}) \]

Model Assessment

library(modelsummary)


models <- list(
  "m1" = lm(Sale_Price ~ Garage_Area, data = ames_numeric),
  "m2" = lm(Sale_Price ~ Gr_Liv_Area, data = ames_numeric),
  "m3" = lm(Sale_Price ~ Lot_Area, data = ames_numeric)
)
modelsummary(models)
m1 m2 m3
(Intercept) 68470.351 13289.634 153373.893
(2737.269) (3269.703) (2319.909)
Garage_Area 237.647
(5.271)
Gr_Liv_Area 111.694
(2.066)
Lot_Area 2.702
(0.181)
Num.Obs. 2930 2930 2930
R2 0.410 0.500 0.071
R2 Adj. 0.410 0.499 0.071
AIC 72924.9 72441.6 74253.9
BIC 72942.9 72459.5 74271.8
Log.Lik. -36459.470 -36217.791 -37123.929
F 2032.837 2922.592 223.941

#Description The metric which we used to compare the three models is R2. The higher R-2 value indicates how well the regression model fits the observed data.In our case the model2(m2) has the highest R2 value that is 0.5. This reveals that 50% of the data fit the regression model.

Model Diagnostics

model_diagnostics<-augment(m2)
model_diagnostics<- model_diagnostics %>% round(6)
model_diagnostics
## # A tibble: 2,930 x 8
##    Sale_Price Gr_Liv_Area .fitted  .resid     .hat .sigma  .cooksd .std.resid
##         <dbl>       <dbl>   <dbl>   <dbl>    <dbl>  <dbl>    <dbl>      <dbl>
##  1     215000        1656 198255.  16745. 0.000374 56533. 0.000016     0.296 
##  2     105000         896 113367.  -8367. 0.000828 56534. 0.000009    -0.148 
##  3     172000        1329 161731.  10269. 0.00038  56534. 0.000006     0.182 
##  4     244000        2110 248964.  -4964. 0.000839 56534. 0.000003    -0.0879
##  5     189900        1629 195239.  -5339. 0.000364 56534. 0.000002    -0.0945
##  6     195500        1604 192447.   3053. 0.000356 56534. 0.000001     0.0540
##  7     213500        1338 162736.  50764. 0.000376 56526. 0.000152     0.898 
##  8     191500        1280 156258.  35242. 0.000406 56530. 0.000079     0.624 
##  9     236500        1616 193787.  42713. 0.000359 56528. 0.000103     0.756 
## 10     189000        1804 214786. -25786. 0.000465 56532. 0.000048    -0.456 
## # ... with 2,920 more rows
ggplot(model_diagnostics, aes(Gr_Liv_Area,Sale_Price)) + geom_point() + geom_line(aes(Gr_Liv_Area,.fitted),color="blue") + geom_segment(data = model_diagnostics %>% slice_sample(n = 30),aes(x=Gr_Liv_Area,y=Sale_Price,xend=Gr_Liv_Area, yend=.fitted), color="red")+
  labs(title="Best fit Model showing predicted and observed sale price alnog with residuals",x="Graded living area ",y="Sale Price")

1)Linearity of data.

ggplot(data = model_diagnostics, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  xlab("Fitted values") +
  ylab("Residuals")

#Description From the plot above between residual and fitted values we can conclude that it is roughly linear.

2)normality of residuals.

ggplot(data = model_diagnostics, aes(x = .resid)) +
  geom_histogram() +
  xlab("Residuals")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Description We can see that the residuals are symmetric and have normal distribution

3)Homogeneity of residuals variance.

4)Independence of residuals error terms.

Show your code. Check all assumptions.

Data Transformation and Re-fitting the Best Model

Show your code in a single chunk.

  ames_numeric_transformed <- ames_numeric %>% mutate(Sale_Price_log10 = log10(ames_numeric$Sale_Price))
  
  
  transformed_model<- lm(Sale_Price_log10 ~ Gr_Liv_Area, data = ames_numeric_transformed)  
  summary(transformed_model)
## 
## Call:
## lm(formula = Sale_Price_log10 ~ Gr_Liv_Area, data = ames_numeric_transformed)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.02587 -0.06577  0.01342  0.07202  0.39231 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.855e+00  7.355e-03  660.12   <2e-16 ***
## Gr_Liv_Area 2.437e-04  4.648e-06   52.43   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1271 on 2928 degrees of freedom
## Multiple R-squared:  0.4842, Adjusted R-squared:  0.484 
## F-statistic:  2749 on 1 and 2928 DF,  p-value: < 2.2e-16
  ames_numeric_transformed <- augment(transformed_model)
  ames_numeric_transformed
## # A tibble: 2,930 x 8
##    Sale_Price_log10 Gr_Liv_Area .fitted  .resid     .hat .sigma    .cooksd
##               <dbl>       <int>   <dbl>   <dbl>    <dbl>  <dbl>      <dbl>
##  1             5.33        1656    5.26  0.0737 0.000374  0.127 0.0000629 
##  2             5.02         896    5.07 -0.0524 0.000828  0.127 0.0000703 
##  3             5.24        1329    5.18  0.0565 0.000380  0.127 0.0000375 
##  4             5.39        2110    5.37  0.0180 0.000839  0.127 0.00000845
##  5             5.28        1629    5.25  0.0264 0.000364  0.127 0.00000783
##  6             5.29        1604    5.25  0.0451 0.000356  0.127 0.0000224 
##  7             5.33        1338    5.18  0.148  0.000376  0.127 0.000256  
##  8             5.28        1280    5.17  0.115  0.000406  0.127 0.000166  
##  9             5.37        1616    5.25  0.125  0.000359  0.127 0.000173  
## 10             5.28        1804    5.29 -0.0183 0.000465  0.127 0.00000484
## # ... with 2,920 more rows, and 1 more variable: .std.resid <dbl>
  Equation_Model_transformed<-extract_eq(transformed_model, use_coefs = TRUE)
  Equation_Model_transformed

\[ \operatorname{\widehat{Sale\_Price\_log10}} = 4.86 + 0(\operatorname{Gr\_Liv\_Area}) \]

  ggplot(ames_numeric_transformed, aes(Gr_Liv_Area,Sale_Price_log10))+ geom_point()+geom_smooth(method=lm, se=FALSE)
## `geom_smooth()` using formula 'y ~ x'

From the equation that we created the y intercept has the value 4.86. Therefore we can say that the value of graded living area is equal to 0 the minimum value of log10 of sale price would be on an average 4.86.

1)Linearity of data.

ggplot(data = ames_numeric_transformed, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  xlab("Fitted values") +
  ylab("Residuals")

#Description From the above plot between residuals and fitted .so we can conclude that it is roughly linear.

2)normality of residuals.

ggplot(data = ames_numeric_transformed, aes(x = .resid)) +
  geom_histogram() +
  xlab("Residuals")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Description We can see that the residuals are symmetric and have normal distribution.

Storyboard

image25 image25

Product type = podcast

sessionInfo()
## R version 4.0.5 (2021-03-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_India.1252  LC_CTYPE=English_India.1252   
## [3] LC_MONETARY=English_India.1252 LC_NUMERIC=C                  
## [5] LC_TIME=English_India.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] modelsummary_0.8.1   RColorBrewer_1.1-2   scatterplot3d_0.3-41
##  [4] plotly_4.9.4.1       ggvis_0.4.7          shinyjs_2.0.0       
##  [7] shiny_1.6.0          equatiomatic_0.2.0   broom_0.7.8         
## [10] skimr_2.1.3          modeldata_0.1.1      forcats_0.5.1       
## [13] stringr_1.4.0        dplyr_1.0.7          purrr_0.3.4         
## [16] readr_1.4.0          tidyr_1.1.3          tibble_3.1.2        
## [19] ggplot2_3.3.5        tidyverse_1.3.1     
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-152      fs_1.5.0          lubridate_1.7.10  webshot_0.5.2    
##  [5] httr_1.4.2        repr_1.1.3        tools_4.0.5       backports_1.2.1  
##  [9] bslib_0.2.5.1     utf8_1.2.1        R6_2.5.0          DBI_1.1.1        
## [13] lazyeval_0.2.2    mgcv_1.8-34       colorspace_2.0-2  withr_2.4.2      
## [17] tidyselect_1.1.1  compiler_4.0.5    cli_2.5.0         rvest_1.0.0      
## [21] xml2_1.3.2        labeling_0.4.2    sass_0.4.0        checkmate_2.0.0  
## [25] scales_1.1.1      tables_0.9.6      systemfonts_1.0.2 digest_0.6.27    
## [29] svglite_2.0.0     rmarkdown_2.9     base64enc_0.1-3   pkgconfig_2.0.3  
## [33] htmltools_0.5.1.1 dbplyr_2.1.1      fastmap_1.1.0     highr_0.9        
## [37] htmlwidgets_1.5.3 rlang_0.4.11      readxl_1.3.1      rstudioapi_0.13  
## [41] jquerylib_0.1.4   generics_0.1.0    farver_2.1.0      jsonlite_1.7.2   
## [45] crosstalk_1.1.1   magrittr_2.0.1    kableExtra_1.3.4  Matrix_1.3-2     
## [49] Rcpp_1.0.6        munsell_0.5.0     fansi_0.5.0       lifecycle_1.0.0  
## [53] stringi_1.6.2     yaml_2.2.1        grid_4.0.5        promises_1.2.0.1 
## [57] crayon_1.4.1      lattice_0.20-41   haven_2.4.1       splines_4.0.5    
## [61] hms_1.1.0         knitr_1.33        pillar_1.6.1      reprex_2.0.0     
## [65] glue_1.4.2        evaluate_0.14     data.table_1.14.0 modelr_0.1.8     
## [69] vctrs_0.3.8       httpuv_1.6.1      cellranger_1.1.0  gtable_0.3.0     
## [73] assertthat_0.2.1  xfun_0.24         mime_0.11         xtable_1.8-4     
## [77] later_1.2.0       viridisLite_0.4.0 ellipsis_0.3.2